Protein Engineering, Design and Selection
◐ Oxford University Press (OUP)
Preprints posted in the last 30 days, ranked by how well they match Protein Engineering, Design and Selection's content profile, based on 14 papers previously published here. The average preprint has a 0.00% match score for this journal, so anything above that is already an above-average fit.
Kim, Y.; Kwon, H.; Hong, J.; Kang, C. K.; Park, W. B.; Kim, H.-R.; Lee, C.-H.
Show abstract
BackgroundCombinatorial fragment antigen-binding (Fab) libraries encode an immense heavy-light chain pairing space, often exceeding 10{superscript 1} possible combinations, which far surpasses the diversity that can be experimentally constructed and screened in display systems. As a result, direct Fab screening samples only a small fraction of the theoretical search space, creating a practical bottleneck for functional binder discovery. ResultsHere, we frame Fab discovery as a staged search problem by decoupling heavy-chain (HC) and light-chain (LC) exploration. We implemented a sequential HC preselection-remating workflow in yeast surface display, in which antigen-reactive HC variants are first enriched and subsequently recombined with a diverse LC repertoire to reconstruct a focused Fab library. In a SARS-CoV-2 spike-targeted campaign, HC and LC libraries of 2.05 x 10 and 2.33 x 10 members corresponded to a theoretical pairing space of approximately 4.8 x 10{superscript 1} combinations. Sequential HC enrichment followed by LC remating allowed recovery of multiple functional Fab clones from a tractable library scale of approximately 10, including clones that shared a common HC scaffold but carried distinct LC partners. A representative recombinant IgG output showed broad but heterogeneous spike/RBD binding, measurable pseudovirus neutralization activity (EC = 11.1 nM), and compatibility with standard early biophysical characterization after full-length IgG reformatting. ConclusionsThese results provide proof of principle that combinatorial Fab discovery can be approached as a staged exploration problem under realistic library-size constraints. By focusing downstream Fab reconstruction on an antigen-compatible HC subspace, sequential HC preselection followed by LC remating offers a practical strategy for exploring otherwise intractable antibody pairing landscapes in eukaryotic display systems.
Trapote Fernandez, A.; Fernandez, A.; Mendez-Liter, J. A.; Prieto, A.; Barriuso, J.; Osorio, F. G.
Show abstract
{beta}-galactosidases (BGs) are essential enzymes widely used in the food industry, particularly in the production of lactose-free products. Among them, the BG from Aspergillus oryzae is of industrial relevance due to its activity at acidic pH and moderate thermal tolerance. However, enhancing its catalytic performance remains a key challenge. Traditional enzyme engineering methods are time-consuming and resource-intensive, limiting their scalability. Recent advances in Artificial Intelligence (AI), particularly those based on Natural Language Processing, offer a promising alternative by enabling efficient exploration of protein sequence space and prediction of beneficial mutations. In this study, we introduce an ensemble-based, zero-shot Protein Language Model pipeline that reconciles predictions from six independent models (ESM2 and the five ESM1v variants) combined with a diversity-aware candidate selection strategy. Applied to the BG from A. oryzae, this approach identified beneficial mutations leading to novel enzyme variants with up to a four-fold increase in catalytic efficiency on oNPGal, a two-fold increase on lactose, and, independently, a T338I variant with markedly enhanced thermostability ({approx}80% residual activity after 24 h at 60 {degrees}C), all without requiring supervised fine-tuning on experimental fitness data. Our results demonstrate that consensus across an ensemble of PLMs can efficiently enrich beneficial substitutions in industrially relevant enzymes and substantially reduce the number of wet-lab candidates that need to be screened. Table of Contents graphic O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=106 SRC="FIGDIR/small/726739v1_ufig1.gif" ALT="Figure 1"> View larger version (29K): org.highwire.dtl.DTLVardef@18084f7org.highwire.dtl.DTLVardef@99a102org.highwire.dtl.DTLVardef@19a64forg.highwire.dtl.DTLVardef@1f59cff_HPS_FORMAT_FIGEXP M_FIG C_FIG
Leaf, C. M.; Qi, P.; Gandhi, Y. P.; Jalali-Yazdi, F.; Ong, J. N.; Takahashi, T. T.; Kalia, R.; Roberts, R. W.
Show abstract
In vitro selection and directed evolution technologies such as mRNA display, explore large libraries ([≥]1014 variants) and generate thousands to millions of functional polypeptide ligands to a variety of targets. Denoising diffusion implicit machine learning models (DDIMs) trained using display-derived deep sequencing data can greatly expand these functional sequences beyond what is accessible experimentally. However, methods are needed to predict peptide properties such as binding free energies ({Delta}G{degrees}). Here, we applied machine learning methods to predict binding free energies of both experimental and DDIM-generated peptide ligands against a target of interest, the oncogenic protein Bcl-xL. To do this, we trained a Closed-form Continuous (CfC) neural network using a dataset of 15,700 peptide ligands where pairs of sequences and their corresponding binding free energies ({Delta}G{degrees}) were used as inputs. This type of model was chosen due to its ability to represent irregular series. The resulting CfC model accurately predicts the rank order, within error, and binding free energies ({Delta}G{degrees}) for both experimental and DDIM-generated peptides, identifying five DDIM-generated peptides with single-digit picomolar affinities. Combining trained DDIM and CfC models offers a unified route to expand the scope of experimental ligand discovery, predict the molecular properties of both experimental and generated ligands, and highlights the utility of large quantitative datasets for making accurate in silico predictions of high-affinity peptide candidates. StatementHigh-throughput sequencing analysis of mRNA display libraries enables generating novel peptide ligands and expands the scope of functional sequences beyond what is accessible experimentally. Closed-form Continuous neural networks trained using sequences and their corresponding free energies accurately predict the binding free energies of both experimental and machine learning-generated peptides, enabling a route to quantitatively predict peptide properties using directed evolution data.
Gingrich, P. W.; Biswas, A.; Mica, I. L.; Brammer, K. M.; Shu, Z.; Maxwell, D. S.; Russell, K. P.; Al-Lazikani, B.
Show abstract
Abstract SummaryReliable structure-based prediction of small-molecule druggability is hindered by a fundamental labeling problem. Experimentally confirmed liganded sites (positives) are observable, but credible "undruggable" pockets (negatives) are almost impossible to define. Standard supervised machine learning consequently relies on arbitrary definitions of undruggable, leading to bias and false negatives. Here we introduce PocketBagger, a positive-unlabeled (PU) learning framework for pocket druggability prediction trained exclusively on experimentally determined Protein Data Bank1 (PDB) structures. PocketBagger uses PU bagging to learn key features associated with reliable druggable pockets and considers all remaining pockets in the structurally characterized proteome as unlabeled. We demonstrate the capability of PocketBagger through the training of a simple Random Forest classifier and demonstrate its power in recall (0.804), even when challenged with increasingly difficult generalizability assessments and entire protein-family hold outs. We benchmark and demonstrate the added value of PU learning by comparing PocketBagger to a leading deep-learning predictor. However, PocketBagger is intended to be used as a framework for any model architecture. Along with the code, the data generated by PocketBagger are deployed in canSAR.ai, providing scalable, generalizable pocket druggability predictions to the drug discovery community.
Bellaiche, A.; Choudhary, P.; Nair, S.; Harrus, D.; Yu, C. W.-H.; Tanweer, S. A.; Evans, G. L.; Lo, S. W.; Martin, M.; Fleming, J. R.; Velankar, S.
Show abstract
Structure Integration with Function, Taxonomy and Sequences (SIFTS) provides residue-level mappings between UniProt Knowledgebase sequences and Protein Data Bank structures and has historically been generated through internal Protein Data Bank in Europe (PDBe) pipelines. Here, PDBe-SIFTS is presented as a fully open-source, locally deployable implementation of this mapping framework. The pipeline combines fast, scalable sequence search using MMseqs2, an improved bounded scoring scheme for ranking candidate mappings, and residue-level mapping refinement based on backbone connectivity. PDBe-SIFTS is distributed as a Python package with command-line tools for 1) building a sequence search database, 2) identifying the best sequence-structure match, 3) one-to-one mapping at the residue level, and 4) generating SIFTS annotations in PDBx/mmCIF format. Benchmarking on the complete Protein Data Bank archive showed that MMseqs2 reduced archive-scale UniProtKB searches from hours with BLASTP to minutes, approximately 22-36 times faster, while curated mappings were recovered at top rank in 93.1% of cases. The remaining discrepancies mainly involved biologically ambiguous cases such as highly conserved proteins, chimeric constructs, or closely related orthologs. These results show that PDBe-SIFTS enables fast mapping, improving structural coherence in residue-level alignments while delivering the most up-to-date and accurate mappings, comparable to expert curation. Tool: https://github.com/PDBeurope/SIFTS Quick start notebook with example: https://github.com/PDBeurope/SIFTS/tree/master/notebooks Broader audience statementMatching protein sequences to their three-dimensional structures, and mapping annotations across both, is essential for understanding protein function, interactions, and molecular mechanisms. This integrated view enables richer interpretation of biological data and underpins advances in drug discovery, disease research, and protein engineering. PDBe-SIFTS provides an open and functional framework for structure-sequence mapping, allowing researchers and databases to run, inspect, and extend these mappings locally, while benefiting from faster searches, transparent scoring, and structurally informed residue-level alignments. Graphical abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=110 SRC="FIGDIR/small/721839v1_ufig1.gif" ALT="Figure 1"> View larger version (25K): org.highwire.dtl.DTLVardef@5e6ea6org.highwire.dtl.DTLVardef@1b2754dorg.highwire.dtl.DTLVardef@1334f9forg.highwire.dtl.DTLVardef@1b083a1_HPS_FORMAT_FIGEXP M_FIG C_FIG
Talpir, I.; Fleishman, S. J.
Show abstract
Computational protein design demands generally applicable models that reliably predict or generate unmeasured variants with superior functional properties. Although protein language models (pLMs) have been used in zero-shot and transfer-learning design studies, they have generally not been assessed in benchmarks that explicitly test combinatorial extrapolation from lower- to higher-order variants. Here we benchmark widely used pLMs against conventional baseline methods in recently described dense, experimentally validated multi-mutant landscapes. We find that regardless of architecture and parameter count, pLMs are statistically similar to one another, and none consistently outperforms conventional baseline methods. Furthermore, their ability to distinguish functional from non-functional variants in zero-shot prediction is comparable to that of conventional homology-based methods. We suggest that to contribute significantly to the design of protein function, pLMs may need to encode biophysical and structural priors or be combined with structure-based approaches.
Gomez Aquino, I.; Ghahremanzamaneh, M.; Tsopanoglou, A.; Blanco, A.; Carillo, S.; Bones, J.; Jimenez del Val, I.
Show abstract
{beta}4-galactosylation is a critical quality attribute of therapeutic monoclonal antibodies (mAbs), enhancing complement-dependent cytotoxicity, antibody-dependent cytotoxicity, and antibody-dependent cellular phagocytosis. Despite its therapeutic importance, galactosylation remains the most variable glycosylation motif due to its sensitivity to cell culture conditions. Here, we describe a dual genetic engineering strategy applied to two mAb-producing CHO cell lines, DP12 and VRC01, to simultaneously overcome the cellular machinery and metabolic bottlenecks that limit {beta}4-galactosylation. The first engineering event knocks out COSMC, the chaperone required for core 1 {beta}-1,3-galactosyltransferase 1 activity, to redirect UDP-Gal consumption from O-linked {beta}3-galactosylation towards mAb Fc N-linked {beta}4-galactosylation. The second event overexpresses {beta}-1,4-galactosyltransferase 1 ({beta}4GalT1) to augment cellular galactosylation machinery. Each modification was characterised individually (COSMC- and GalT+) and in combination (C-/GT+) across both cell lines in batch and fed batch cultures. The combined C-/GT+ strategy consistently achieved greater than 90% mAb Fc {beta}4-galactosylation, irrespective of host cell line or culture mode. Metabolic characterisation confirmed that both engineering events alleviate their respective bottlenecks: COSMC knockout redirects UDP-Gal flux and {beta}4GalT1 overexpression increases N-galactosylation capacity. The C-/GT+ strategy also reduced production of Man5 glycans, which accelerate serum clearance and pose immunogenicity risks. Metabolic profiling suggests that the COSMC knockout attenuates UTP consumption and contributes to reduced Man5 production. C-/GT+ glycoengineering had no negative impact on mAb titre. Our results establish the C-/GT+ dual glycoengineering strategy as a robust approach for consistently achieving high mAb galactosylation across diverse cell culture conditions, with the additional benefit of reduced Man5 glycans. HighlightsO_LIDual COSMC KO and {beta}4GalT1 overexpression achieves >90% mAb Fc galactosylation. C_LIO_LICOSMC KO redirects UDP-Gal from O-glycans to mAb Fc without impacting cell growth. C_LIO_LIDual glycoengineering reduces production of undesired Man5 glycans. C_LI
Kim, J.; Romero, P. A.
Show abstract
Large language models (LLMs) are increasingly deployed as agents for scientific discovery, but standardized frame-works for evaluating their performance and behaviour in scientific workflows are lacking. Protein design provides a demanding test case because modern workflows combine stochastic generative models, structure prediction systems, and physics-based evaluation tools that require extensive candidate exploration and filtering. Here we introduce BioDesignBench, a benchmark of 76 expert-curated protein design tasks spanning antibodies, enzymes, fluorescent proteins, binders, and scaffolds, together with human and non-LLM baselines and behavioural metrics derived from tool-use traces. We evaluate four frontier LLM agents across diverse protein design workflows and find that the strongest agents surpass deterministic hardcoded pipelines but consistently underperform expert practice. Although agents generally select appropriate tools, they evaluate candidate designs too shallowly, rarely compare alternatives, and terminate exploration prematurely. Guided workflows improve tool coverage but not evaluation depth. Enforcing deeper multi-metric evaluation substantially improves agent performance, demonstrating that these limitations are behavioural rather than fundamental capability constraints. We release BioDesignBench, open-source reference agents, and a public leaderboard as a community resource for evaluating and improving AI agents for protein engineering.
Fieux-Castagnet, A.; Waton, J.; Glukhonemykh, A.; Snow, E.; Ashokkumar, R.; Fleming, J.; Champagne, D.; Devenyns, T.; Peluffo, A.; Anagnostopoulos, C.
Show abstract
Protein structure prediction models (such as AlphaFold, Chai, Boltz) have transformed structural biology and are increasingly explored for drug discovery; however, their utility for large-scale screening of antibody-antigen (AB-AG) interactions remains unclear, particularly for distinguishing true binding from non-binding pairs at scale. To our knowledge, there has not been an exhaustive exploration of Boltz-2 inference settings on this high impact problem, and in this paper we set out to describe and implement a novel benchmarking framework that can accelerate progress in the field. We evaluated Boltz-2 (NVIDIA NIM implementation) on 519 therapeutic monoclonal antibodies from Thera-SAbDab, pairing each antibody with its cognate target and a randomly assigned non-cognate antigen. We developed a novel evaluation framework that systematically captures variability across stochastic seeds while benchmarking different inference settings, including datasets with and without crystallographically resolved antibody structures. Across settings, Boltz-2-derived confidence metrics showed weak, though above-chance, discrimination (0.5 < ROC-AUC < 0.60). Among evaluated metrics, the minimum value of the interface predicted TM-score (ipTM-min) across seed-samples, captured the strongest signal. Interestingly, additional feature aggregation and multivariate modelling provided little to no improvement. Increasing the number of stochastic predictions yielded front-loaded gains, with diminishing returns beyond [~]15-20 seed-samples, suggesting limited value of extensive sampling in practical workflows. Notably, inference without multiple sequence alignments (MSAs) slightly improved performance on non-crystallized antibodies ({Delta}AUROC {approx} +0.027) while reducing runtime by [~]8 seconds per prediction compared to shallow MSA settings. Overall, these results indicate that off-the-shelf confidence metrics from general-purpose structure prediction models may be insufficient for reliable target-antibody screening and highlight the need for task-specific optimization, while confirming that modest amounts of sampling can be helpful, but not in itself sufficient to improve performance significantly as gains plateau relatively quickly.
Lin, Y.; Lee, M.; Vermani, A.; Jiang, E.; De Cooman, S.; Spetko, M.; AlQuraishi, M.
Show abstract
Despite the breakneck pace of progress in protein design methodology, frontier problems remain challenging, with leading methods struggling to design high-affinity binders, scaffold multiple functional motifs, or stabilize large multi-domain proteins. Recent research efforts have focused on two areas: improving model reasoning when generating active sites or binding interfaces, and improving concordance between the design process and the in silico oracle used to select promising designs. In addressing the first, the field has shifted towards all-atom models that capture sidechain conformations in atomistic detail by eschewing data-efficient SE(3)-equivariance, mirroring the evolution of AlphaFold2 to AlphaFold3. In addressing the second, recent work has focused on replacing generative models employing diffusion or flow-matching with hallucination approaches that directly optimize the oracle in sequence space; this improves success rates but reduces computational efficiency. Here, we close and surpass the generation-hallucination gap by revisiting SE(3)-equivariance using a branched polymer treatment of protein structures. The resulting diffusion model, Genie 3, achieves state-of-the-art performance on binder design, motif scaffolding, and unconditional generation, while being significantly faster than the best existing methods. We use Genie 3 to design a nanomolar binder of Nipah Glycoprotein G, a tetramer with minimal structural or biophysical characterization, as part of the Adaptyv Bio Nipah Competition, achieving a 12.5% success rate. Taken together, our results present a new frontier in protein design capability and a reexamination of the role of SE(3)-equivariance in molecular modeling.
Dohi, E.
Show abstract
We screened a 5 receptor x 7 aptamer = 35-cell cross-target matrix with HADDOCK3 [1] under blind ambiguous-interaction-restraint (AIR) protocols on AlphaFold-modelled receptors. The screen surfaced 12 operationally distinct failure modes (collapsing to [~]8 conceptual classes; [§]3.1). The K_D-calibration subset is n = 4 cells with literature K_D records under matched assay conditions; the broader cohort includes [≥] 6 biological cognate or intended-cognate cells. The principal case study is P01031 (complement C5, 1676 aa, [≥] 12 structural domains): all 7 panel members produced positive HADDOCK3 top-1 scores under a scale-adaptive AIR. Score-term decomposition locates the anomaly in the AIR term (+217 to +268 to top-1 score). With AIR zeroed, scores fall to -131 to -74 -- the small-receptor regime. Boltz-2 cofolding chain-pair ipTM (cpi_AB) is an independent channel: P01031 shows the lowest median cpi_AB (0.211; 0/7 above the 0.5 confident-interface threshold). To our knowledge, this is the first reported case study of a 1676 aa multi-domain receptor exhibiting this signature under blind scale-adaptive AIR -- an n = 1 mechanistic case, not a statistical generalisation. We adapt the QSAR applicability domain concept [14-16] to in silico aptamer screening. [§]3.7 reports an empirical Mode 1 mitigation (pLDDT-aware AIR prefilter; cohort Jaccard recovery [~]10x).
Spinner, A.; Notin, P.; Berry, S.; Cortade, D.; Sisson, Z.; Ikonomova, S.; Ross, D.; Marks, D.
Show abstract
Generative models are increasingly used for protein design, but the lack of standardized evaluation frameworks limits comparison across model classes and hinders translation to experimental success. Here, we introduce a unified sampling and benchmarking framework that enables controlled sequence generation across alignment, protein language, and structure-based models, and apply it to Tobacco etch virus (TEV) protease. Across hundreds of thousands of designed sequences, different models explore distinct regions of sequence space with no clear computational selection metrics to assess enzymatic function. Experimental evaluation reveals large differences in functional outcomes, ranging from non-functional variants to sequences with 9-fold higher activity than wildtype. Machine learning-designed libraries achieve a 39.32% hit rate (percentage of variants matching or exceeding wildtype activity) compared to 6.06% for an error-prone PCR baseline. Structure-based models perform best overall, with hit rates of 74.4% and 66.8% for ESM-IF1 and ProteinMPNN, respectively. Commonly used selection metrics do not strongly correlate with experimental activity, highlighting a gap between in silico evaluation and enzyme function. Together, these results establish a generalizable framework for benchmarking generative protein models and demonstrate the necessity of experimental validation for guiding model development and sequence prioritization.
Condruti, R.; Muthuraj, L.; Prakash, J. K.; Littman, S. D.; Kumar R., P.; Nair, N. U.
Show abstract
In Anabaena variabilis (Trichormus variabilis) phenylalanine ammonia-lyase (AvPAL), a conserved lid-like loop sits over the active site and has been studied both for its role in positioning a catalytic tyrosine and for its contribution to phenylalanine aminomutase (PAM) activity. While the active site architecture and substrate specificity of AvPAL have been extensively characterized, the dynamic behavior of this unstructured loop beyond its role in catalysis remains poorly understood. Here, we investigate the functional role of this loop by restricting its mobility through targeted interchain disulfide bond engineering. Three in-house approaches were designed to predict ideal cysteine residue pairs: (i) quantifying pair interaction energies via electrostatic and van der Waals forces, (ii) generating a contact map of residues within 5 [A] proximity, and (iii) implementing a machine-learning model trained on datasets from PDBCYS, SPX, and an internal database to rank cysteine pair likelihood within disulfide bond geometric constraints. Our machine-learning-guided strategy yielded a successful variant with complete oxidation efficiency in E. coli. Rigidification of this loop reveals that it also functions as a regulator of substrate specificity. Multi-scale molecular simulation analyses (molecular dynamics, metadynamics, quantum/molecular mechanics) reveal that this modification alters the active-site pocket by reducing the conformational dynamics of substrate binding. Our findings underscore the delicate balance between enzyme flexibility and catalytic efficiency, providing novel insights into the role of this understudied dynamic loop region in AvPAL.
Ai, Y.; He, Y.; Zhao, L.; Li, M.; Wang, Y.; Zhou, J.; Lu, H.; Yu, Y.
Show abstract
Human N-glycoproteins constitute a market worth hundreds of billions of dollars. However, their production in yeast is often limited by misfolding and subsequent degradation, largely due to differences in N-glycan-dependent protein quality control (QC) systems between humans and yeast. Notably, yeast lacks the UGGT-mediated reglucosylation-refolding cycle that rescues misfolded glycoproteins, and its degradation pathway involves fewer rate-limiting steps. To address this, we engineer the glycoprotein QC system in Kluyveromyces marxianus, a promising host for protein production, by introducing key human components and modifying native pathways. Expression of human UGGT1 or UGGT2 enhances the soluble and secretory production of glycoproteins in an activity-dependent manner. This effect is further improved by co-expression of the UGGT cochaperone SEP15 and by reducing native glucosidase II trimming activity. In addition, introduction of human EDEM2, a rate-limiting enzyme in glycoprotein degradation, delays ER-associated degradation and increases secretion. Integration of these engineering strategies substantially enhances the production of several high-value human-derived glycoprotein therapeutics, including etanercept, dulaglutide, and abatacept, with up to a [~]12-fold increase. These findings demonstrate that engineering a human-like glycoprotein QC network in yeast is an effective strategy to improve glycoprotein folding and secretion.
Cha, H.; Cho, K.; Gu, J.; Gwak, D.; Ham, S. W.; Hong, M.; Kim, S.; Kim, S.; Kwon, S.; Lee, C.; Lee, D. K.; Lee, D.; Lee, D.; Lim, J.; Noh, J.; Oh, S.; Park, E.; Park, S.; Park, T.; Ryu, E.; Ryu, S.; Sa, D. H.; Seok, C.; Sim, J.; Song, M. Y.; Won, J.; Woo, H.; Yang, J.
Show abstract
The precise de novo design of antibodies remains a therapeutic challenge. The AI platform, GaluxDesign, was evaluated in a high-efficiency Precision-Scale Workflow by synthesizing and testing only 50 full-length IgG candidates per epitope across eight distinct epitopes from six therapeutic targets. This campaign yielded a 10.5% binder rate (estimated EC50 < 100 nM), identifying target-specific binders for seven of eight epitopes, with multiple candidates exhibiting sub-nanomolar to single-digit nanomolar dissociation constants (Kd). We further assessed the same workflow on nine shared benchmark targets selected for external comparison, where GaluxDesign identified target-specific binders for eight of nine targets, demonstrating strong target-level performance relative to previously reported de novo antibody design approaches. Together, these results establish a high-efficiency, precision-scale workflow for generating novel, high-affinity therapeutic antibodies.
Weiner, I. N.
Show abstract
Cetuximab is a chimeric IgG1 monoclonal antibody that has been a cornerstone therapy for EGFR-driven malignancies for nearly two decades. Its therapeutic activity is governed by competitive displacement of endogenous EGFR ligands, making binding affinity a direct determinant of clinical efficacy. We applied ConvergeAB, a target-aware antibody design platform, in a fully zero-shot configuration to generate a biobetter version of cetuximab. The lead Converge-designed antibody binds EGFR with a mean KD of 315 pM -- approximately 2.1-fold tighter than cetuximab (673 pM) and 4.4-fold tighter than a recently published, computationally designed anti-EGFR antibody from Cradle Bio (1.38 nM). The affinity gain arises from six substitutions that leave the global paratope architecture intact (C RMSD 0.15 [A] vs cetuximab) and instead optimize the binding interface through localized packing and electrostatic adjustments. A panel of biophysical and developability assays -- HIC, DLS, DSF, and PSR ELISA -- shows that the Converge variant matches or exceeds cetuximab on monomericity, monodispersity, polyspecificity, and thermal stability, while remaining within a developable hydrophobicity envelope. Together, these data demonstrate that a single zero-shot ConvergeAB campaign can deliver a biobetter molecule with significantly improved affinity and a clean developability profile, without compromising the parental antibodys drug-like properties.
Bajgain, Y.; Guo, M.; Hager, K. M.; Nguyen, A. W.; Zhang, Y.; Maynard, J. A.
Show abstract
Antibody-dependent cellular cytotoxicity (ADCC) is a major mechanism of action for many FDA-approved therapeutic antibodies that is driven by interactions between the antibody Fc and Fc{gamma} receptors (Fc{gamma}Rs) on immune effector cells. Murine models used for preclinical antibody evaluation currently have limited predictive value for clinical ADCC performance due to interspecies differences in Fc-Fc{gamma}R interactions. The molecular determinants governing Fc-Fc{gamma}R engagement in mice remain poorly defined, complicating the interpretation of murine ADCC data and its clinical relevance. To address this, we present the high-resolution crystal structure of the receptor that regulates Fc-mediated cytotoxicity in mice, mouse Fc{gamma}RIV, alone and in complex with mouse IgG2a Fc. This complex preserves key features of the human IgG1 Fc-human Fc{gamma}RIIIa interface which mediates ADCC in humans including salt bridges, hydrogen bonds, and a proline sandwich. However, subtle variations in receptor orientation, Fc-Fc{gamma}R electrostatics, and glycan positions reduce human IgG1 Fc- mouse Fc{gamma}RIV binding affinity, resulting in species-restricted Fc-Fc{gamma}R mediated immune responses. Modeling of human IgG1 Fc interactions with mouse Fc{gamma}RIV predicted steric clashes, suggesting opportunities to modulate the interaction. One structure-guided substitution variant of human IgG1, Fchumo, maintains comparable human Fc{gamma}RIIIa engagement with enhanced binding to and activation of mouse Fc{gamma}RIV, relative to human IgG1 Fc. This study provides proof-of-concept for engineering human Fc domains for cross-species Fc{gamma}R recognition and provides a strategic framework to improve the predictive power of in vivo preclinical models.
Obendorf, L.; Doering, N. P.; Knaus, P.; Wolber, G.
Show abstract
AI-driven cofolding models have emerged as powerful tools for predicting protein-ligand complexes, yet whether ligand placement faithfully captures the conformational states of dynamic proteins remains unclear. Here we show that cofolding adaptively remodels binding pockets around bound ligands, but that this local accuracy is frequently decoupled from recovery of the broader conformational state. We benchmark four models, AlphaFold3, RosettaFold3, Boltz-2, and Chai-1, against a set of kinases and class A G protein-coupled receptors (GPCRs), protein families whose pharmacology depends on well-defined structural states. We find that even when ligand root-mean-square deviation (RMSD) is low, critical state markers, including kinase activation-loop geometries and GPCR intracellular arrangements, are frequently mispredicted. Incorporating state-annotated templates and filtered multiple sequence alignments (MSAs) improves conformational recovery in selected cases, yet weakly impacts others. Furthermore, while orthosteric ligand placement is generally reliable, allosteric binders expose a consistent blind spot across all models. These findings establish conformational decoupling as a fundamental limitation of current cofolding approaches, with direct implications for state-selective drug design.
Sakurai, A.; Shoji, K.; Ichihashi, N.
Show abstract
Improving the reconstituted translation system is a key requirement for bottom-up synthetic biology. Here, we developed a two-step in vitro evolutionary method that can be used for improving translational proteins. In this method, two distinct conditions were sequentially applied while maintaining genotype-phenotype linkage in water-in-oil droplets. Using this method, we performed in vitro evolution of four translation factors, IleRS, PheRS, EF-G, and EF-Tu, and identified mutations that modestly enhanced translation activity in in vitro expression assays. One of the EF-G mutations (P610S) increased activity per protein approximately 2-fold for the recombinant protein purified from E. coli. This selection method is useful for improving translational proteins for bottom-up synthetic biology.
Aldas-Bulos, V. D.; Plisson, F.
Show abstract
Machine learning continues to accelerate peptide and protein design through the rapid prediction and generation of sequences with desired characteristics. Many applications focus on predicting properties, functions, and structures, as well as generating point mutations and de novo designs. Nevertheless, many models prove less generalizable than initially claimed. Most predictors and generators are trained on sequential datasets, where imbalances can be addressed during preprocessing. In contrast, structural bias, a subtype of algorithmic bias arising from uneven representation of structural classes in training datasets, and the limitations of early protein structure predictors have frequently remained undetected and uncorrected. The recent surge in powerful protein structure prediction tools, such as the AlphaFold and RosettaFold series and their variants, now presents opportunities to mitigate this issue. We hypothesize that such structural sampling biases influence the downstream performance of ML models. Using antimicrobial peptides as a case study, we audited the structural biases in 16 state-of-the-art predictors for antimicrobial activity and tested whether structural information constrains their predictions. Our analysis revealed that models explicitly trained on sequential data still produce predictions biased by uneven fold representations and data leakage. These findings highlight the importance of integrating balanced structural data or implementing bias-mitigating strategies to develop agnostic models that maximize bioactive protein discovery and multi-objective optimization.